bottleneck feature
CORAL: Disentangling Latent Representations in Long-Tailed Diffusion
Rodriguez, Esther, Welfert, Monica, McDowell, Samuel, Stromberg, Nathan, Camarena, Julian Antolin, Sankar, Lalitha
Diffusion models have achieved impressive performance in generating high-quality and diverse synthetic data. However, their success typically assumes a class-balanced training distribution. In real-world settings, multi-class data often follow a long-tailed distribution, where standard diffusion models struggle -- producing low-diversity and lower-quality samples for tail classes. While this degradation is well-documented, its underlying cause remains poorly understood. In this work, we investigate the behavior of diffusion models trained on long-tailed datasets and identify a key issue: the latent representations (from the bottleneck layer of the U-Net) for tail class subspaces exhibit significant overlap with those of head classes, leading to feature borrowing and poor generation quality. Importantly, we show that this is not merely due to limited data per class, but that the relative class imbalance significantly contributes to this phenomenon. To address this, we propose COntrastive Regularization for Aligning Latents (CORAL), a contrastive latent alignment framework that leverages supervised contrastive losses to encourage well-separated latent class representations. Experiments demonstrate that CORAL significantly improves both the diversity and visual quality of samples generated for tail classes relative to state-of-the-art methods.
Multimodal Federated Learning With Missing Modalities through Feature Imputation Network
Poudel, Pranav, Chhetri, Aavash, Gyawali, Prashnna, Leontidis, Georgios, Bhattarai, Binod
Multimodal federated learning holds immense potential for collaboratively training models from multiple sources without sharing raw data, addressing both data scarcity and privacy concerns--two key challenges in healthcare. A major challenge in training multimodal federated models in healthcare is the presence of missing modalities due to multiple reasons, including variations in clinical practice, cost and accessibility constraints, retrospective data collection, privacy concerns, and occasional technical or human errors. Previous methods typically rely on publicly available real datasets or synthetic data to compensate for missing modalities. However, obtaining real datasets for every disease is impractical, and training generative models to synthesize missing modalities is computationally expensive and prone to errors due to the high dimensionality of medical data. In this paper, we propose a novel, lightweight, low-dimensional feature translator to reconstruct bottleneck features of the missing modalities. Our experiments on three different datasets (MIMIC-CXR, NIH Open-I, and CheXpert), in both homogeneous and heterogeneous settings consistently improve the performance of competitive baselines.
Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
Liu, Haofeng, Xu, Chenshu, Yang, Yifei, Zeng, Lihua, He, Shengfeng
Point-based interactive editing serves as an essential tool to complement the controllability of existing generative models. A concurrent work, DragDiffusion, updates the diffusion latent map in response to user inputs, causing global latent map alterations. This results in imprecise preservation of the original content and unsuccessful editing due to gradient vanishing. In contrast, we present DragNoise, offering robust and accelerated editing without retracing the latent map. The core rationale of DragNoise lies in utilizing the predicted noise output of each U-Net as a semantic editor. This approach is grounded in two critical observations: firstly, the bottleneck features of U-Net inherently possess semantically rich features ideal for interactive editing; secondly, high-level semantics, established early in the denoising process, show minimal variation in subsequent stages. Leveraging these insights, DragNoise edits diffusion semantics in a single denoising step and efficiently propagates these changes, ensuring stability and efficiency in diffusion editing. Comparative experiments reveal that DragNoise achieves superior control and semantic retention, reducing the optimization time by over 50% compared to DragDiffusion. Our codes are available at https://github.com/haofengl/DragNoise.
Dimensionality Reduction for Improving Out-of-Distribution Detection in Medical Image Segmentation
Woodland, McKell, Patel, Nihil, Taie, Mais Al, Yung, Joshua P., Netherton, Tucker J., Patel, Ankit B., Brock, Kristy K.
Clinically deployed segmentation models are known to fail on data outside of their training distribution. As these models perform well on most cases, it is imperative to detect out-of-distribution (OOD) images at inference to protect against automation bias. This work applies the Mahalanobis distance post hoc to the bottleneck features of a Swin UNETR model that segments the liver on T1-weighted magnetic resonance imaging. By reducing the dimensions of the bottleneck features with principal component analysis, OOD images were detected with high performance and minimal computational load.
CNN based Dog Breed Classifier Using Stacked Pretrained Models
In this article, we will learn how to classify images based on fine details of images using a stacked pre-trained model to get maximum accuracy in TensorFlow. Hey folks, I hope you have done some image classification using pre-trained TensorFlow or TensorFlowor other CNN pre-trained models and might have some idea about how we classify images, but when it comes to classifying finely detailed objects (dog breed, cat breed, leaves diseases) this method doesn't give us a good result, in this case, we would prefer model stacking to capture most of the details. Let's get straight to the technicalities of it. In our dataset, we have 120 dog breeds and we will have to classify them using a stacked pre-trained model (TensorFlow, Densenet121) which is trained on Imagenet. We will stack bottleneck features extracted by these models for greater accuracy that will depend on the models we are stacking together.
Neural Network based End-to-End Query by Example Spoken Term Detection
Ram, Dhananjay, Miculicich, Lesly, Bourlard, Hervรฉ
--This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time warping (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual bottleneck features, and show that multilingual features perform increasingly better with more training languages. Previously, it has been shown that the DTW based matching can be replaced with a CNN based matching while using posterior features. Here, we show that the CNN based matching outperforms DTW based matching using bottleneck features as well. In this case, the feature extraction and pattern matching stages of our QbE-STD system are optimized independently of each other . We propose to integrate these two stages in a fully neural network based end-to-end learning framework to enable joint optimization of those two stages simultaneously. The proposed approaches are evaluated on two challenging multilingual datasets: Spoken Web Search 2013 and Query by Example Search on Speech T ask 2014, demonstrating in each case significant improvements. Query-by-example spoken term detection (QbE-STD) is defined as the task of detecting all files from an audio archive which contain a spoken query provided by a user (see Figure 1). It enables users to search through multilingual audio archives using their own speech. The primary difference from keyword spotting is that QbE-STD relies on spoken queries instead of textual queries making it a language independent task. In general, the queries and test utterances are generated by different speakers in different languages with varying acoustic conditions and without constraints on vocabulary, pronunciation lexicon, accents etc. Thus, the search is performed relying only on acoustic data of the query and test utterances with no language specific resources, as a zero-resource task. It is essentially a pattern matching problem in the context of speech data where the targeted pattern is the information represented using speech signal and given to the system as a spoken query.
Build Deeper: What's in the Book
So, let's see what I've covered in the book. Build Deeper: The Path to Deep Learning The new book is the successor to my earlier book - Build Deeper: Deep Learning Beginners' Guide - (which is why I called this the'second edition), to which I've added a lot more topics this time. The new book is more than twice the length of the old book, and covers more breadth and depth in Deep Learning. Here's what you can expect in the book: A detailed explanation on what Deep Learning is, what it isn't, and how it relates to other areas in AI. What Deep Learning has achieved through the years, including recent achievements such as OpenAI, and DeepMind.
Saliency Supervision: An Intuitive and Effective Approach for Pain Intensity Regression
Li, Conghui, Zhu, Zhaocheng, Zhao, Yuming
Getting pain intensity from face images is an important problem in autonomous nursing systems. However, due to the limitation in data sources and the subjectiveness in pain intensity values, it is hard to adopt modern deep neural networks for this problem without domain-specific auxiliary design. Inspired by human vision priori, we propose a novel approach called saliency supervision, where we directly regularize deep networks to focus on facial area that is discriminative for pain regression. Through alternative training between saliency supervision and global loss, our method can learn sparse and robust features, which is proved helpful for pain intensity regression. We verified saliency supervision with face-verification network backbone on the widely-used dataset, and achieved state-of-art performance without bells and whistles. Our saliency supervision is intuitive in spirit, yet effective in performance. We believe such saliency supervision is essential in dealing with ill-posed datasets, and has potential in a wide range of vision tasks.
Almost Zero-Resource ASR-free Keyword Spotting using Multilingual Bottleneck Features and Correspondence Autoencoders
Menon, Raghav, Kamper, Herman, Quinn, John, Niesler, Thomas
We compare features for dynamic time warping based keyword spotting in an almost zero-resource setting. The objective is to support United Nations (UN) humanitarian relief efforts in parts of Africa with severely under-resourced languages. As supervised resource, we restrict ourselves to an easily-compiled small set of isolated keywords. For feature extraction, we integrate a multilingual bottleneck feature extractor (BNF), trained on well-resourced out-of-domain languages, with a correspondence autoencoder (CAE), trained on extremely sparse in-domain data. We find that, on their own, BNFs and CAE features achieve more than 2% absolute performance improvement over baseline MFCCs. However, by using BNFs as input to the CAE, even better performance is achieved, with an 11% absolute improvement in ROC AUC over MFCCs and twice as many top-10 retrievals. We conclude that integrating BNFs with the CAE allows both large out-of-domain and sparse in-domain resources to be exploited for improved ASR-free keyword spotting.
Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition
Lian, Zheng, Li, Ya, Tao, Jianhua, Huang, Jian
Automatic emotion recognition is a challenging task. In this paper, we present our effort for the audio-video based sub-challenge of the Emotion Recognition in the Wild (EmotiW) 2018 challenge, which requires participants to assign a single emotion label to the video clip from the six universal emotions (Anger, Disgust, Fear, Happiness, Sad and Surprise) and Neutral. The proposed multimodal emotion recognition system takes audio, video and text information into account. Except for handcraft features, we also extract bottleneck features from deep neutral networks (DNNs) via transfer learning. Both temporal classifiers and non-temporal classifiers are evaluated to obtain the best unimodal emotion classification result. Then possibilities are extracted and passed into the Beam Search Fusion (BS-Fusion). We test our method in the EmotiW 2018 challenge and we gain promising results. Compared with the baseline system, there is a significant improvement. We achieve 60.34% accuracy on the testing dataset, which is only 1.5% lower than the winner. It shows that our method is very competitive.